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Abstract 

In computer vision, an entity such as an image or video 
is often represented as a set of instance vectors, which 
can be SIFT, motion, or deep learning feature vectors ex¬ 
tracted from different parts of that entity. Thus, it is es¬ 
sential to design efficient and effective methods to com¬ 
pare two sets of instance vectors. Existing methods such 
as FV, VLAD or Super Vectors have achieved excellent re¬ 
sults. However, this paper shows that these methods are 
designed based on a generative perspective, and a discrim¬ 
inative method can be more effective in categorizing images 
or videos. The proposed D3 (discriminative distribution 
distance) method effectively compares two sets as two dis¬ 
tributions, and proposes a directional total variation dis¬ 
tance (DTVD) to measure how separated are they. Fur¬ 
thermore, a robust classifier-based method is proposed to 
estimate DTVD robustly. The D3 method is evaluated in 
action and image recognition tasks and has achieved excel¬ 
lent accuracy and speed. D3 also has a synergy with FV. 
The combination of D3 and FV has advantages over D3, 
FV, and VLAD. 

1. Introduction 

In visual recognition, an entity (object or video) is usu¬ 
ally represented as a set of instance vectors. Each instance 
vector is extracted using part of the entity (e.g., a local win¬ 
dow extracted from an image or a time-space subvolume 
extracted from a video). Various features have emerged 
as the state-of-the-art to extract instance vectors at differ¬ 
ent stages of recognition research, such as dense SIFT fea¬ 
tures [21], dense CENTRIST features [29] or CNN features 
for images [ 1 3], or (improved) dense trajectory features [28] 
or CNN features for videos [ ]. Although originally CNN 
(or other deep learning methods) integrates visual repre¬ 
sentation and classification into one system [19, 15], re¬ 
cent works have shown that if multiple (a set of) CNN 
features are extracted from entities and classify images or 
videos based on these sets, higher accuracies can be ob¬ 


tained [8, 32, 3, 31]. 

Because most existing learning algorithms assume that 
an entity is represented as a vector instead of a set of vec¬ 
tors, we need to find a suitable visual representation that 
encodes the set of instance vectors into one single vector. It 
is desirable that the representation will capture useful (i.e., 
discriminative) information from the set. Thus, comparing 
one entity (a set of instance vectors) to another can be di¬ 
vided into two steps: first represent the sets as two vectors, 
then find a suitable distance metric to compare the vectors. 
One useful variant is to compare one entity to a set of en¬ 
tities (e.g., all training images, corresponding to a bigger 
union set by gathering the instance vectors in every image), 
which is often used too. 

Since the £2 distance (or correspondingly linear SVM) is 
very efficient and has shown great accuracy in the second 
step, an effective visual representation that turns a set of in¬ 
stance vectors into one single vector (i.e., the first step) has 
been very important in visual recognition research efforts. 
Many representations have been proposed, for example, 

• Fisher Vector (FV) and VLAD. FV [25] is based on 

the idea of Fisher kernel in machine learning [11]. It 
models the distribution of instance vectors in training 
entities using a Gaussian Mixture Model (GMM). Then, 
one training or testing entity is modeled generatively, by 
a vector which describes how the GMM can be modi¬ 
fied to generate the instance vectors inside that entity. A 
GMM with K components has three sets of parameters 
(wi, 1 < i < K. VLAD [12], another pop¬ 

ular visual representation, can be regarded as a special 
case of FV, by using only the /x parameters. The classic 
bag-of-visual-words (BOVW) [4] representation is also 
a special case of FV, using the w parameters. 

• Super-Vector Instead of modeling the instance vectors 
as distributions, the Super-Vector [35] represents a set of 
instance vectors based on how they can be reconstructed 
from dictionary items. Super-Vector aims at reducing 
the reconstruction error, which is also from a generative 
perspective. The output of Super-Vector has two parts, 
which are conceptually related to the w and fi parame- 
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ters in Fisher Vectors. 

Both threads of methods have shown excellent accuracy 
in the literature. However, they both focus on modeling how 
one entity or one distribution is generated. Given the fact 
that the task in hand is recognition, we argue that we need to 
pay more attention to how two entities or two distributions 
are separated. In other words, we need a visual representa¬ 
tion that pays more attention to the discriminative side. We 
naturally expect that such a representation would be suitable 
for visual recognition tasks, whose objective is to properly 
separate entities belonging to different categories. 

In this paper, we propose a discriminative distribution 
distance (D3) representation that converts a set of instance 
vectors into a vector representation. D3 explicitly consid¬ 
ers two distributions: a density X which is estimated from 
the training set as a reference model, and one entity forms 
another distribution Y. D3 then uses the distribution dis¬ 
tance between X and V as a discriminative representation 
for the entity Y. Technically, D3 has the following contri¬ 
butions. 

• We propose a direction total variation distance (DTVD) 
to measure the distance between X and Y, which con¬ 
tains more discriminative information than classic dis¬ 
tribution distances by considering directions', 

• Directly calculating DTVD is unstable and problematic 
because Y may be non-Gaussian and only contains few 
items. We propose to estimate DTVD in a discrimina¬ 
tive manner, by calculating robust classification errors 
when we try to classify every dimension of X from Y ; 

• We also show that D3 and FV are complementary to 
each other. By combining D3 and FV, we can achieve 
an accuracy higher than D3, FV, and VLAD. 

We will start by explaining closely related methods, then 
proposing the directional distribution distance, its robust 
estimation, and the entire D3 pipeline in Sec. 2. Sec. 3 
presents empirical results, and Sec. 4 concludes this paper. 

2. Discriminative Distribution Distance 

In this section, we propose a discriminative distribution 
distance (abbreviated as D3) to compare two sets of obser¬ 
vations, which leads to an efficient and effective visual rep¬ 
resentation. 

2.1. Distribution distance: generative vs. discrimi¬ 
native 

Given two objects X and Y, each of which is repre¬ 
sented as an unordered set of instance vectors, i.e., X = 
{xi,...,Xn^}, Y = we are interested 

in hnding d{X, Y), the distance (or dissimilarity) between 
them. This task is frequently encountered in compute vi¬ 
sion. For example, an image or a video is usually repre¬ 
sented as a set of feature vectors extracted from various im¬ 


age patches or supervoxels. 

In the Fisher Vector (FV) representation, a large set of in¬ 
stance vectors are extracted from training images or videos. 
We treat this set as X and a Gaussian Mixture Model 
(GMM) px with parameters A = {(wfc, is 

estimated from X. When a test image or video y is pre¬ 
sented, we extract its instance vectors and treat it as Y. The 
FV representation considers X and Y as generated from 
two underlying distributions px and py, and encodes y as 
a vector /. This is a generative model and each component 
in / describes how each parameter in A should be modihed 
such that Px can be modihed to ht the data Y properly. 

Specihcally, the probability that is generated by the 
fc-th Gaussian is 

ll{k)=p{k\y^,X) = ^WkPk{y^\>y), ( 1 ) 


where Z is a normalization constant, pk is the fc-th Gaussian 
component with weight Wk, and parameters (/i.j.,Sfe). In 
FV, the GMM covariances are assumed to be diagonal, 
whose diagonal entries form a vector cr^. The trends of 
parameter changing (gradients) that modify px to ht y is 
then (for all 1 < fc < AT) [25] 
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( 2 ) 

(3) 

(4) 


which correspond to the w, p, cr parameters for px, respec¬ 
tively. The image or video y is then represented by a fea¬ 
ture vector f, which concatenates f,,, , f,, , and f _ for 
all 1 < fc < AT. 

We want to emphasize two observations based on the FV 
representation. 

• The gradient vector f is formed under the generative 
assumption that Y can be modeled by px if we are al¬ 
lowed to modify the parameter set A. Since what we are 
really interested in is how far is X from Y, we believe 
that a discriminative distance between X and V is a bet¬ 
ter option. That is, in this paper we will treat X and Y 
as sampled from two different distributions px and py, 
and hnd their distribution distance to encode the image 
or video V; 

• Since diagonal Yk are used, after the soft assignment 
probabilities 7 i(/c) are calculated, each dimension of 
/ is generated independent of any other dimension. 
Thus, in hnding a suitable representation for Y, we 
only need to consider each dimension individually. The 
problem is then: given two sets of scalar values X = 







{xi, a; 2 , ■ ■ ■, } and Y = {yi,y 2 , ■ ■ ■} (sampled from 1-d 
distribution px and py, respectively), how do we prop¬ 
erly compute d{px,PY)'^ 

As a final note in this section, the VLAD and super vec¬ 
tor representation can be interpreted as special cases of FV, 
while VLAD uses the components of /, and super vec¬ 
tors use both and . It is also a common practice to 

use only and in FV implementations. 

2.2. Directional Total Variation Distance 

We need to be more discriminative. Thus, we propose to 
explicitly consider two distributions X and Y, where X is a 
density of instance vectors estimated from the training set, 
and Y is from one (training or testing) entity. A representa¬ 
tion of Y that encodes the distance between X and Y will 
contain useful discriminative information about Y. 

A widely used distance that compares two distributions 
is the total variation distance, which is independent of the 
distributions’ parameterizations. Let i^i and 1/2 be two prob¬ 
ability measures on a measurable space the total 

variation distance is defined as 

dTv{’^x,i'Y) = snp\i^x{A) - lyyiA)]. (5) 

agb 

While this definition is rather formal, dyv has a more in¬ 
tuitive equation for commonly used continuous distribu¬ 
tions by the Scheffe’s Lemma [5]. For example, for two 
normal distributions with p.d.f X ~ 7V(/ix,CT^) and 
Y^N{pY,<7l), 

dTv{px,PY) = ^ J \px{u) -pY{u)\du. ( 6 ) 

As illustrated in Fig. la, it is half the summed area of the 
red and green regions, which clearly indicates how two dis¬ 
tributions are separated from each other. 

The classic total variation distance (Eq. 6), however, is 
missing one most important information that captures the 
key difference between px and py, as shown in Fig. lb. In 
Fig. lb. Pi and p2 are symmetric with respect to the mean 
of p, thus we have dyviP^Pi) = dTv{PiP2), in spite of 
the fact that pi and p2 are far apart. The missing of direc¬ 
tional information is responsible for this drawback. Thus, 
we propose a directional total variation distance (DTVD) as 

dDTviPx,PY) = sign{py - Px) X dTv{px,PY) ■ ( 7 ) 

DTVD is a signed distance. In Fig. lb, we will (correctly) 
have dDTv{p,Pi) — —dDTv{p,P 2 ), which clearly signi¬ 
fies the difference between pi and p 2 - Obviously the dis¬ 
tance function duTV is not a metric, because it is neither 
non-negative, nor symmetric. 



(a) 



(b) 

Figure 1. Illustration of the total variation distance, la illustrates 
dry for two Gaussians, and lb reveals that direction is essential. 

2.3. Robust estimation of the DTVD 

For two Gaussians px and py, their p.d.f. will have 
two intersections if ax 7 ^ o’y. For example, in Fig. la 
the second intersection is in the far right end of the x-axis. 
A closed-form solution to calculate dyv based on both in¬ 
tersections is available [5]. However, this closed-form solu¬ 
tion leads to serious performance drop when used in visual 
recognition in our experiments. We conjecture that two rea¬ 
sons have caused this issue: 

• The distributions are not necessarily normal. As shown 
in Fig. 2, the typical example of px (in Fig. 2a) is gener¬ 
ated from many training instances, its shape resembles 
that of a Gaussian, but has a shaper peak, py, which is 
generated from a single image, deviates from a normal 
distribution; 

• Since the set Y (which is extracted from a single im¬ 
age or video, cf Fig. 2b) usually contains small number 
of instance vectors, this fact leads to unstable estima¬ 
tion of its distribution parameters, and hence unstable 
doTviPXjPY)- Thus, we need a more robust way to 
estimate the distribution distance. 

Our key insight again arises from the discriminative per¬ 
spective. It is obvious that the total variation distance dyv 






















































































(a) (b) 

Figure 2. Typical distribution of feature values. 2a is calculated 
based on features used to generate the codebook, and 2b is from 
a single image. The red curve is a normal distribution estimated 
from the same data. This figure is generated using bag of dense 
SIFT on the Scene 15 dataset with 7T = 64 VLAD encoding. The 
dimension shown is the 37-th dimension in the 37-th cluster of the 
codebook. 


is equivalent to one minus the Bayes error of a binary clas¬ 
sification problem, where the two classes have equal prior 
and follow px and py, respectively. Thus, we can estimate 
dxv (hence djjTv) by robustly estimating the classification 
error between the two sets of examples X and Y. Note that 
this task is easy since X and Y only contain scalar exam¬ 
ples. 

We adopt the minimax probability machine (MPM) [17] 
to estimate the classification error. MPM is robust because 
it minimizes the maximum probability of misclassification, 
hence the name minimax. Given examples X with mean 
Px and covariance Sx and examples Y with py and Yy, 
the classifier boundary a^x — & = 0 is determined by the 
MPM problem 


= min y /+ y/ a^Eya 

a 

(8) 

a^iP'X - Mr) = 1 1 

(9) 

r = Exo* . 

(10) 


Eq. 9 is a second order cone problem (SOCP) that can 
be solved by an iterative algorithm. However, since we are 
dealing with scalar examples that (assumed to) follow nor¬ 
mal distributions, it has a closed form solution. Note that 
X ^ N{px,o'x) and Y ~ N^py^ay), we can immedi¬ 
ately get the following boundary a*a; — 5* = 0, where 


1 , dxcry + pyax 

CLir - A 0^ - dir X 

Px - By CTx + cry 

\dx - Py\ 


( 11 ) 

( 12 ) 


That is, the two 1-d distributions px and py are classified 
at the threshold value 

crx + cry 

If we re-use Fig. la and (approximately) assume the red, 
blue, and green areas intersect at T = , which 

is guaranteed to reside in between px and py. Then, the 
area of the blue region is: 

= +T> . (14) 

where 

$(x) = (15) 

is the cumulative distribution function (c.d.f.) of a standard 
normal distribution iV(0,1). And, we have 


dDTv{Px,PY) = 2 - 2Area = 4$ 


/ Py — Px 
V crx + cry 


making use of the fact that 


- 2 , 

(16) 


T — Px _ Py — Px _ T — py 
crx crx + cry cry ’ 

and the property of 4> that $(—a;) = 1 — $(a;). 

Two points are worth mentioning about Eq. 16. 

• Although our derivation and Fig. la is assuming px < 
py, it is easy to derive that when px > py, Eq. 16 still 
holds. And, it always have the same sign as py — px- 
Hence, Eq. 16 computes doTV instead of dyvi its 
range is [—2 2]. 

• In practice we use the error function. The error function 
is defined as 


1 2 

erf(a;) = dt, (17) 

and it satisfies that 

*M=^I+ert(^)). ( 18 ) 

Thus, we have 

dDTviPx,PY) = 2eri 

Vv2(crx + cry)/ 

The error function erf is built-in and efficient in most 
major programming languages, which facilitates the 
calculation of djjTV using Eq. 19. 

We also want to note there has been research to model the 
discriminative distance between two sets of instance vec¬ 
tors. In [23], non-parametric kernels are estimated from two 


crx + cry 


















































Algorithm 1 Visual representation using D3 


Input: An image or video Y = {1/1,1/21 • ■ 
a dictionary (visual code book) with size K and cluster 
mean rti, and standard deviations cri. (1 < fc < K) 
for i = 1 , 2 ,...,/r do 

Y' = {yjlVj e r, argmini<;,<^ \\y^ - Mfcll = *} 
Compute the mean and standard deviation vectors of 
the set Y', denote as fi' and cr', respectively 

Note that the erf function is applied to every compo¬ 
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Output: The new representation / G 
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sets, and use the Bellinger’s distance or the Renyi-a diver¬ 
gence to measure the distance between two distributions. 
This method, however, suffers from one major limitation. 
Non-parametric kernel estimation is very time consuming, 
which took 3.3 days in a subset of the Scene 15 dataset, 
a fact that renders it impractical for large problems. As a 
direct comparison, D3 only requires less than 2 minutes. 

2.4. The pipeline using dorv for visual recognition 

We assume that an image or video y is represented as 
a bag of instance feature vectors Y = { 1 / 1 , 1/21 ■ • where 
each 1 /j G The instance vectors are usually extracted as 
dense SIFT vectors or deep learning (CNN) features for im¬ 
ages, or dense trajectory features or deep learning features 
for videos, or other representations that use a set of vectors 
to represent an entity. 

The pipeline to use doTV to generate image or video 
representation follows two steps. 

• Dictionary generation. For simplicity and computa¬ 
tional efficiency, we collect a large set of instance vec¬ 
tors from the training set, and then use the fc-means al¬ 
gorithm to generate a dictionary that partitions the space 
of instance vectors into K regions. We compute the 
mean and standard deviation of the instance vectors in¬ 
side cluster k as and cr*, for all 1 < fc < AT. Values 
in the standard deviation vector is computed for ev¬ 
ery dimension independently; 

• Visual represeutatiou. Given an image or video y, we 
use Algorithm 1 to convert it to a vector representation. 
Note that since we normalize every f ^ in Algorithm 1, 
the constant factor (‘2’) in Eq. 19 is not necessary and 
is thus omitted. 

In Algorithm 1, we use the fc-means algorithm to gen¬ 


erate a visual codebook, and an instance vector is hard- 
assigned to one visual code word. A GMM model can also 
be used as a soft codebook, similar to what is performed in 
FV. However, a GMM has higher costs in generating both 
the dictionary and visual representation. Thus, we use fc- 
means to generate a codebook in D3. Then, D3 and VLAD 
have very similar frameworks, and it is interesting to com¬ 
pare D3 with both VLAD and FV. 

2.5. Efficiency and hybrid representation 

Since the error function implementation is efficient, the 
computational cost of D3 is roughly the same as that of 
VLAD, which is much more efficient than the FV method. 
The evaluation in [22] showed that the time for VLAD is 
only less than 5% of that of FV. Thus, a visual representa¬ 
tion using D3 is efficient to compute. 

It is also worth noting that although D3 and FV both 
used first- and second-order statistics of an image or video 
Y and compare these statistics with those computed from 
the training set, they use these statistics in very different 
ways. Thus, different information (discriminative vs. gen¬ 
erative) are extracted by D3 and FV. By computing the D3 
and FV representation separately and then concatenate them 
together to form a hybrid one, we can get higher recognition 
accuracy than both D3 and FV, as will be shown in Sec. 3. 
Suppose we form a dK dimensional D3 vector and a dK di¬ 
mensional FV vector, the hybrid representation will be 2dK 
dimensional. However, its computational time will be only 
roughly half of that of forming a 2dK dimensional FV rep¬ 
resentation. 

A final note is about higher order VLAD. VLAD only 
uses first-order statistics (mean) of the set of instance vec¬ 
tors. In [22], higher-order statistics (variance and skewness) 
are added to effectively improve VLAD. Since this method 
will triple the number of dimensions of VLAD (with the 
same K) and its accuracy is not as high as LV, we will 
not empirically compare D3 with this method in this pa¬ 
per. However, because D3 does not specify how a codebook 
is generated, the supervised codebook generation method 
of [22] can be adopted to further improve D3 in the future. 

3. Experimental Results 

To compare the representations fairly, we compare them 
using the same number of dimensions. For example, the 
following setups will be compared to each other. 

• D3 (or VLAD) with Ki = 256 visual words; the repre¬ 
sentation has dKi dimensions; 

• FV with K 2 = 128 components {2dK2 = dKi)\ 

• A mixture of D3 and FV with = 128 in D3 and 
Ki = 64 in FV {dK^ + 2dKi = 2dK2 = dKi). 

We will use D3’s K size to indicate the size of all the above 
setups {i.e., K = 256 in this example). 












(a) (b) 

Figure 3. Distribution of per-dimensional mutual information. 3a 
shows the quantile values in the full range, and 3b is the frequency 
of high (most discriminative) MI values. 

Three types of experiments are performed. First, a small 
image dataset is used to study the property of D3 (Sec. 3. 1). 
Then, D3 is evaluated in action recognition (with the ITF 
features, in Sec. 3.2) and in image recognition (with CNN 
features, in Sec. 3.3). Discussions are in Sec. 3.4. 

3.1. Why use D3? 

We first study the properties of the proposed D3 repre¬ 
sentation, and shed some lights on why it is an effective 
way to encode the distance between two sets of dense SIFT 
instance vectors. 

Using the training images of the Scene 15 dataset [18] 
and dense SIFT features (with step size 4), we compare the 
per-dimensional discriminative power of these two repre¬ 
sentations (D3 and VLAD). 

Suppose X is the D3 or VLAD representation of a set 
of images with corresponding image labels I, whose f-th di¬ 
mension form a vector x.^. It is natural to measure the dis¬ 
criminative power of the f-th dimension by computing the 
mutual information between x.,i and I, i.e., MI(a;:i, 1) [33]. 
We use the 2-bit method in [33] to quantize x-i and compute 
the mutual information. The distribution of all dimension’s 
MI values are shown in Fig. 3. 

Fig. 3a shows the Mi’s quantile values. For example, 
when the a;-axis is 0.5, the D3 and VLAD curve has value 
4.8292 and 4.7966, meaning that the median of D3’s MI 
value is above the of VLAD’s by 0.0326. Similarly, when 
the j/-axis is 4.8282, the D3 and VLAD curves has corre¬ 
sponding quantile (a;-axis) values 0.5 and 0.3888, mean¬ 
ing that 50% of D3’s MI value is above 0.4282, but only 
38.88% of VLAD’s dimensions reaches this discriminative 
power. Since the D3 (red solid) curve is almost consistently 
above the VLAD (dashed black) curve, D3’s dimensions 
have higher discriminative power than VLAD’s. 

Fig. 3b shows the frequencies of dimensions that have 


the highest MI values. Although VLAD has a few dimen¬ 
sions that have higher MI values than D3, D3 obviously 
have many more discriminative dimensions. The total num¬ 
ber of dimensions with MI values > 5.25 are 2020 and 1544 
for D3 and VLAD, respectively, a 30.8% advantage for D3. 
Since every single dimension is too weak to classify the im¬ 
age, it is more important to have many dimensions with 
good discriminative powers than having few only slightly 
more discriminative ones. 

3.2. Action recognition results 

We first show experimental results for action recogni¬ 
tion. A set of improved trajectory features (ITF) [28] are 
extracted and then converted to D3, VLAD, FV, and two 
hybrid representations (D3 h-FV and VLADh-FV). The de¬ 
fault parameters are used to extract ITF features. 

We experimented on three datasets: UCF 101 [27], 
HMDB 51 [16] and Youtube [20]. For UCF 101, the three 
splits of train and test videos in [14] are used and we re¬ 
port the average accuracy. This dataset has 13320 videos 
and 101 action categories. The HMDB 51 dataset has 51 
actions in 6766 clips. We use the original (not stabilized) 
videos and follow [16] to report average accuracy of its 3 
predefined splits of training / testing videos. Youtube is 
a small scale dataset with 11 action types. There are 25 
groups in each action category and 4 videos are used in each 
group. Following the original protocol, we report the aver¬ 
age of the 25-fold leave one group cross validation accuracy 
rates. Results on these datasets are reported in Table 1. We 
summarize the experimental results into the following ob¬ 
servations. 

D3 is better than VLAD in almost all cases. In the 
12 comparisons between D3 and VLAD, D3 wins in 11 
cases. D3 often has a margin even if it use half of num¬ 
ber of dimensions of VLAD (e.g., D3 A = 128 vs. VLAD 
K = 256). It shows that the D3 representation is effective 
in capturing useful information for classification. 

D3 bridges the gap between VLAD and FV In practice 
we often see that FV has higher accuracy than VLAD, but 
also much higher computational costs. D3 has roughly the 
same speed as VLAD, but its accuracy is close to that of 
FV. Compared to VLAD whose accuracies are usually 2- 
3% lower than FV, D3 has much closer accuracy rates to 
FV. On average, D3 is 1% worse than FV. On the Youtube 
dataset D3 is better than FV (91.55% vs. 91.00%). Given 
the computational benefits of D3, it can act as an attractive 
alternative for FV. 

The hybrid D3 / FV representation (nearly) consistently 
outperforms all other methods. We show that the hybrid 
methods are the best performers in Table 1. The D3 h-FV 
representation is especially effective: it is the winner in 8 
out of 9 cases. With K = 128 in the Youtube set being the 
only exception, D3 h-FV consistently beats other methods. 













Table 1. Action recognition accuracy (%) comparisons. Note that the results in one column are compared with the same number of 
dimensions in the representations. For example, the column if=256 means that K = 256 for D3 and VLAD, K = 128 for FV, and in the 
hybrid representation, K — 128 for D3 or VLAD combined with if = 64 for FV. Note that if = 64 results for the hybrid representation 
is not presented. The best results are shown in hold face. 


K 

512 

UCF 101 

256 128 

64 

512 

HMDB 51 

256 128 

64 

512 

Youtube 

256 128 

64 

D3 

84.35 

84.32 

83.03 

81.34 

56.14 

55.29 

54.71 

51.70 

89.91 

91.55 

91.09 

90.36 

VLAD 

82.81 

82.54 

81.59 

79.78 

55.45 

55.14 

53.92 

50.22 

90.00 

89.73 

89.18 

89.09 

FV 

85.23 

84.82 

83.80 

82.48 

58.13 

57.34 

55.88 

53.20 

91.00 

91.00 

90.73 

90.45 

D3+FV 

85.92 

85.44 

84.20 


58.34 

57.63 

56.58 


91.73 

91.36 

90.45 


VLAD+FV 

85.23 

84.54 

83.52 


58.13 

57.60 

55.64 


90.91 

91.36 

90.82 



including FV and VLAD+FV. 

Two points are worth pointing out. First, the success of 
D3+FV shows that the information encoded in D3 and FV, 
although both used first and second order statistics of the 
two distributions, are complementary to each other. The 
hybrid of these two outperforms both D3 and FV. Since the 
running time of D3+FV is only roughly half of that of FV, 
D3+FV is attractive in both speed and accuracy. Second, 
VLAD+FV is obviously inferior to D3+FV. Its accuracy is 
very similar to that of FV, but lower than D3+FV in most 
cases. 

3.3. Image recognition results 

Now we test how D3 (and the comparison methods) 
work with instance vectors that are extracted by state-of- 
the-art deep learning methods. To extract instance vectors, 
we use the DSP (deep spatial pyramid) method [1], which 
spatially integrates deep fully convolutional networks. A set 
of instance vectors are efficiently extracted, each of which 
corresponds to a spatial region (i.e., receptive held) in the 
original image. The CNN model we use is imagenet-vgg- 
verydeep-16 in [26] till the last convolutional layer, and 
the input image is resized such that its shortest edge is no 
smaller than 314 pixels, and its longest edge is no larger 
than 1120 pixels. Six spatial regions are used, correspond¬ 
ing to the level 1 and 0 regions in [29]. [1] hnds that FV 
or VLAD usually achieves optimal performance with very 
small K sizes in DSP. Hence, we test K € {4, 8}. 

The following image datasets are used. 

• Scene 15 [18]. It contains 15 categories of scene im¬ 
ages. We use 100 training images per category, the rest 
are for testing. 

• MIT indoor 67 [24]. It has 15620 images in 67 indoor 
scene types. We use the train/test split provided in [24]. 

• Caltech 101 [6]. It consists of 9K images in 101 object 
categories plus a background category. We train on 30 
and test on 50 images per category. 

• Caltech 256 [9]. It is a superset of Caltech 101, with 
3IK images, and 256 object plus 1 background cate¬ 
gories. We train on 60 images per category, the rest for 
testing. 


• SUN 397 [30]. It is a large scale scene recognition 
dataset, with 397 categories and at least 100 images per 
category. We use the hrst 3 train/test splits of [30]. 

Except for the indoor and SUN datasets, we run 3 ran¬ 
dom train/test splits in each dataset. Average accuracy rates 
on these datasets are reported in Table 2. As shown by 
the standard deviation numbers in Table 2, the deep learn¬ 
ing instance vectors are stable and the standard deviations 
are small in most cases. Thus, we tested with 3 random 
train/test splits instead of more (e.g., 5 or 10). 

D3 and D3+FV have shown excellent results when com¬ 
bining with instance vectors extracted by deep nets. We 
have the following key observations from Table 2, which 
mostly coincides well with the observations concerning ac¬ 
tion recognition in Table 1 . The last row in Table 2 shows 
the current state-of-the-art recognition accuracy in the liter¬ 
ature, which are achieved by various systems that depend 
on deep learning using the same evaluation protocol. 

D3 is slightly better than FV. D3 is better than FV in 
3 datasets (Scene 15, indoor 67 and SUN 397), but worse 
than FV in the two Caltech datasets. It is worth noting that 
D3’s accuracy is higher than that of FV by a larger margin 
in indoor 67 (1-2%) and SUN 397 (1.5-2.2%), while FV is 
only higher than D3 by 0.3-0.7% in the Caltech 101 and 256 
datasets. Another important observation is that the win/loss 
are consistent among the train/test splits. In other words, if 
D3 wins (loses) in one dataset, it wins (loses) consistently 
in all three splits.' Thus, the CNN instance vectors lead to 
stable comparison results, and we believe 3 train/test splits 
are enough to compare these algorithms. 

VLAD is better than both D3 and FV, but D3 bridges 
the gap between VLAD and FV Although FV usually out¬ 
performs VLAD in image classihcation and retrieval using 
dense SIFT features and in the action recognition results of 
Table 1, a reversed trend is shown in Table 2 using CNN in¬ 
stance vectors. VLAD is almost consistently better than FV, 
up to 3.2% higher in the SUN 397 dataset. The accuracy of 
D3, however, is much closer to that of VLAD than FV’s ac¬ 
curacy. D3 is usually 0.3%-0.6% lower than VLAD, with 
only two cases up to 1.1% (K = 8 in Caltech 256 and SUN 

* Detailed per-split accuracy numbers are omitted. 















Table 2. Image recognition accuracy (percent) comparisons. The definition of K is the same as that used in Table 1. The best results are 
shown in bold face. Standard deviations are also showed after the ± sign. 



Scene 15 

K = 4, K = 8 

MIT indoor 67 

K = 4 K = 8 

Caltech 101 

A = 4 K = 8 

Caltech 256 

A = 4 A = 8 

SUN 397 

K = 4 K ^8 

D3 

92.34±o.23 92.10±o.65 

77.31 77.76 

93.60±o.i7 93.80±o.58 

83.15±0.15 82.92±o.o9 

59.93±o.24 60.22±o.o7 

VLAD 

92.58±0.60 92.61 ±0.42 

77.61 78.13 

94.20±0.39 94. 11 ±0.57 

84.01 ±0.02 84.00±o.io 

60.61 ±0.25 61.22±0.33 

FV 

91.96±o.4o 91.53±o.56 

75.97 75.82 

94.32±o.5i 94.10±o.33 

83.75±0.16 83.40±o.i3 

58.40±o.i2 57.97±o.28 

D3+FV 

92.83±o.55 92.82±o.3i 

77.09 77.99 

94.72±o.5i 94.51±o.44 

84.77 ±0.12 84.62±o.i5 

61.48±o.22 61 .38±o.52 

VLAD+FV 

92.82±o.52 92.76±o.56 

77.54 78.06 

94.71±o.4i 94.45±o.5i 

84.18 ±0.51 84.61 ±0.16 

61.32±o.26 61.83±o.27 

D3+VLAD 

92.82±o.3o 92.92±o.i9 

77.01 77.91 

94.59±o.54 94.45±o.4i 

84.09±o.25 84.31 ±0.14 

60.38±0.30 61 .48±o.32 


91.59±0.48 [34] 

77.56 [8] 

93.42±0.50[IO] 

77.61±0.12 [2] 

53.86±0.21 [34] 


397). 

The hybrid methods are all effective, and D3+FV is 
the overall winning method. The second part of Table 2 
presents results of hybrid methods. Beyond D3+FV and 
VLAD+FV, we also add the results of D3+VLAD, because 
VLAD is the winner in the first part of Table 2. Excluding 
the MIT indoor 67 dataset, obviously all hybrid methods 
have higher accuracy rates than every individual method. 
Among the hybrid methods, D3+FV is the overall win¬ 
ner again. It has the highest accuracy in 6 cases, while 
VLADh-FV and D3H-VLAD has only one each. When com¬ 
paring D3 h-FV with D3, FV or VLAD in detail, this hybrid 
method has higher accuracy than any single method in all 
train/test splits in all 36 comparisons (4 datasets excluding 
the indoor 67 dataset x 3 individual representations x 3 
train/test splits). The MIT indoor 67 dataset is a special 
case, where VLAD is better than all other methods. We 
are not yet clear what characteristic of this dataset makes it 
particularly suitable for VLAD. 

The fact that D3 is in general inferior to VLAD in 
this setup also indicates that CNN instance vectors have 
different characteristics than the dense SIFT vectors (cf. 
Sec. 3.1), for which VLAD is inferior to D3. 

This might be caused by the fact that D3 and VLAD used 
very small K values {K = 4 or 8) with CNN instance vec¬ 
tors, compared to Ff > 64 in Sec. 3.1. Hence, both meth¬ 
ods have much fewer number of dimensions now, and a few 
VLAD dimensions with highest discriminative powers may 
lead to better performance than D3. We will leave a careful, 
more detailed analysis of this observation to future work. 

Significantly higher accuracy than state-of-the-art, es¬ 
pecially in those difficult datasets. DSP [1] (with D3 or 
other individual representation methods) is a strong base¬ 
line, which already outperforms previous state-of-the-art in 
the literature (shown in the last row of Table 2). The hy¬ 
brid method D3 h-FV leads to even better performance, e.g., 
its accuracy is 7.2% higher than [2] for the Caltech 256 
dataset,^ and 7.6% higher than the place deep model of [34] 
for SUN 397. 

^[26] reported an average recall rate of 86.2% for Caltech 256. DSP’s 
average recall is 89.12% and D3+FV is 90.25% (K = 4). 


3.4. Discussions 

Overall, the proposed D3 representation method has the 
following properties: 

• D3 is discriminative, efficient, and stable. D3 is not 

the individual representation method that leads to the 
highest accuracy. FV is the best in our action recog¬ 
nition experiments with ITF instance vectors, while 
VLAD is the best in our image categorization exper¬ 
iments using CNN features. It is, however, the most 
stable one. It is only slightly worse than FV in action 
recognition and slightly worse than VLAD in image cat¬ 
egorization. Although VLAD is outperformed by FV by 
a large margin in action recognition (Table 1) and vice 
versa for image categorization (Table 2), D3 has stably 
achieved high accuracy rates. D3 is also as efficient as 
VLAD, and is much faster than the FV method; 

• D3-hFV is the overall winning method. Using the 
same number of dimensions for all individual and hy¬ 
brid methods, D3-I-VLAD has shown the best perfor¬ 
mance, which indicates that the information encoded by 
D3 and FV form a synergy. Since the FV part of D3 
only uses half the number of Gaussian components than 
that in individual FV, D3 h-FV is still more efficient than 
FV alone. 

In short, D3 and D3 h-FV are effective and efficient in encod¬ 
ing entities that are represented as sets of instance vectors. 

4. Conclusions and Future Work 

We proposed the Discriminative Distribution Distance 
(D3) method to encode an entity (which comprises of a set 
of instance vectors) into a vector representation. Unlike ex¬ 
isting methods such as FV, VLAD or Super Vectors that are 
designed from a generative perspective, D3 is based on dis¬ 
criminative ideas. We proposed to use directional distances 
to measure how two distributions (sets of vectors) are dif¬ 
ferent with each other, and proposed to use the robust MPM 
classifier to robustly estimate this distance. 

These discriminative design choices lead to excellent 
classification accuracy of the proposed D3 representation, 
which are verified by extensive experiments on action and 


















image categorization datasets. D3 is also efficient, and the 
hybrid D3+FV representation has achieved the best results 
among compared individual and hybrid methods. 

In the same spirit as D3, we plan to combine D3 and FV 
in a principled way, which will add discriminative perspec¬ 
tives to FV and will further reduce the computational cost 
of the hybrid representation using D3 h-FV. We will further 
study how the benefits of VLAD can be utilized (e.g., when 
CNN instance vectors are used). 
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